feat: Add structured Output decoding support for vLLM-spyre by R3hankhan123 · Pull Request #903 · torch-spyre/sendnn-inference

R3hankhan123 · 2026-04-07T11:13:06Z

Description

Add structured decoding support for vllm spyre.

Test Plan

Run the vllm server and set structured output backend to xgrammar and outlines

Test output

With using xgrammar

[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Classify sentiment: I love this product"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "enum",
       "schema": {
         "type": "object",
         "properties": {
           "sentiment": {
             "type": "string",
             "enum": ["positive", "negative", "neutral"]
           }
         },
         "required": ["sentiment"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1161  100   623  100   538    629    543 --:--:-- --:--:-- --:--:--  1172
{
 "sentiment": "positive"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Score this answer: The result is correct"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "score",
       "schema": {
         "type": "object",
         "properties": {
           "score": { "type": "number" },
           "passed": { "type": "boolean" }
         },
         "required": ["score", "passed"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1129  100   627  100   502    430    345  0:00:01  0:00:01 --:--:--   775
{
 "score": 1,
 "passed": true
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract name and optional email: John (no email provided)"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "optional",
       "schema": {
         "type": "object",
         "properties": {
           "name": { "type": "string" },
           "email": { "type": "string" }
         },
         "required": ["name"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1156  100   648  100   508    419    328  0:00:01  0:00:01 --:--:--   748
{
 "name": "John",
 "email": "not provided"
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "Extract order: Order 123 has 2 apples and 3 bananas"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "order",
       "schema": {
         "type": "object",
         "properties": {
           "order": {
             "type": "object",
             "properties": {
               "id": { "type": "string" },
               "items": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "product": { "type": "string" },
                     "quantity": { "type": "integer" }
                   },
                   "required": ["product", "quantity"]
                 }
               }
             },
             "required": ["id", "items"]
           }
         },
         "required": ["order"]
       }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1830  100   819  100  1011    179    222  0:00:04  0:00:04 --:--:--   244
{
 "order": {
   "id": "123",
   "items": [
     {
       "product": "apples",
       "quantity": 2
     },
     {
       "product": "bananas",
       "quantity": 3
     }
   ]
 }
}
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{
   "model": "ibm-granite/granite-3.3-8b-instruct",
   "messages": [
     {"role": "user", "content": "What is the capital of India?"}
   ],
   "response_format": {
     "type": "json_schema",
     "json_schema": {
       "name": "qa",
       "schema": {
         "type": "object",
         "properties": {
           "answer": { "type": "string" },
           "confidence": { "type": "number" }
         },
         "required": ["answer"]
       }
     }
   }
 }' | jq -r '.choices[0].message.content' | jq
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1124  100   641  100   483    411    309  0:00:01  0:00:01 --:--:--   720
{
 "answer": "New Delhi",
 "confidence": 1
}

With outlines

 [root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract name and age from: John is 28 and lives in Bangalore"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "nested",
        "schema": {
          "type": "object",
          "properties": {
            "person": {
              "type": "object",
              "properties": {
                "name": { "type": "string" },
                "age": { "type": "integer" }
              },
              "required": ["name", "age"]
            },
            "city": { "type": "string" }
          },
          "required": ["person", "city"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2545  100  1818  100   727    224     89  0:00:08  0:00:08 --:--:--   464
{
  "person": {
    "name": "John",
    "age": 28
  },
  "city": "Bangalore"
}
[root@b314lp81 ~]# 
[root@b314lp81 ~]# curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Extract people: John is 28, Alice is 30"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "array",
        "schema": {
          "type": "object",
          "properties": {
            "people": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": { "type": "string" },
                  "age": { "type": "integer" }
                },
                "required": ["name", "age"]
              }
            }
          },
          "required": ["people"]
        }
      }
    }
  }' | jq -r '.choices[0].message.content' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1478  100   739  100   739    214    214  0:00:03  0:00:03 --:--:--   428
{
  "people": [
    {
      "name": "John",
      "age": 28
    },
    {
      "name": "Alice",
      "age": 30
    }
  ]
}

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-04-07T11:13:16Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

joerunde · 2026-04-09T14:25:16Z

+        ## llguidance is not supported on s390x due to endianness issues.
+        if platform.machine() == "s390x":
+            backend = self.vllm_config.structured_outputs_config.backend
+            if backend == "guidance":


Should this return an error to the user instead?

Before this PR, we always silently ignored structured output requests since we didn't support it at all and didn't want to break tool calling integrations that sometimes requested it. But now that we do support structured output, it might be better to return errors where it's misconfigured so that users know to go fix their deployments. Otherwise we'll be in a mixed state where sometimes structured output works, and sometimes it's not applied.

Now raising RuntimeError instead of warning

Oh- @R3hankhan123 I'm not sure this can be done here in the model runner though. Any assertions that happen here generally crash the server. We would need to return an error for this request only, and the easiest way to do this is to validate the request before it even hits the engine, which we can do in SpyrePlatform.validate_request

joerunde · 2026-04-13T15:32:08Z

bot:test

joerunde

A few things we need to clean up before merging:

We should probably check for the existence of llguidance rather than doing a platform check on s390x, so that we can be forward-compatible if a version of guidance is released that works on Z
We should validate the structured outputs config up-front when we validate the request in platform.py to avoid setting the guidance backend when llguidance is not installed
We should validate that the model is not deployed with guidance as the default structured output backend if llguidance is not installed
We should add tests for these validation cases, as well as one small test that calls a model with a structured output option so we can ensure it actually works

R3hankhan123 · 2026-04-21T04:59:08Z

@joerunde llgudiance version 1.7.3 contains the fix and i have added it in the overrides of pyproject, can you take a look thanks

joerunde · 2026-04-21T23:24:13Z

@R3hankhan123 let's update the PR description to remove the bit about llguidance not supported on s390x

Also before merging we still need to encode the examples from the description into unit tests. We should probably flex all the different backends, and be sure that a prompt in the batch that doesn't request structured outputs does not accidentally have structured outputs applied. See test_sampling_params.py for examples of testing other sampling params, that should be a good fit for this test.

joerunde · 2026-04-22T17:06:30Z

+    # Verify output is valid JSON
+    try:
+        json_obj = json.loads(output_text)
+        assert isinstance(json_obj, dict), "Output should be a JSON object"


I'm skeptical that the micro models we use for testing will actually always generate an object- but if this passes on spyre for both fp16 and fp8 then it's probably okay

joerunde · 2026-04-22T17:08:49Z

+    )
+
+    # Generate with structured output
+    output_structured = spyre_model.generate([prompt_structured], [params_structured])[0]


these need to be a single request, like

output = spyre_model.generate([prompt_structured, prompt_freeform], [params_structured, params_freeform])

the point is to ensure that requests that are running within the same batch can have different structured output setttings

@joerunde using this e2e tests is failing with

(EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] raise self._exception (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return self.worker.execute_model(scheduler_output) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_worker.py", line 770, in execute_model (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] output = self.model_runner.execute_model(scheduler_output) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] return func(*args, **kwargs) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 1541, in execute_model (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] self.maybe_setup_new_prefill(scheduler_output) (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] File "/home/runner/work/sendnn-inference/sendnn-inference/.venv/lib/python3.12/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 1478, in maybe_setup_new_prefill (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] assert len(scheduler_output.scheduled_new_reqs) == 1, ( (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3614) ERROR 04-22 17:44:37 [core.py:1110] AssertionError: Can only schedule one chunked prefill at a time

joerunde · 2026-04-22T17:10:01Z

+        assert "name" in json_obj, "Output should have 'name' field"
+        assert "age" in json_obj, "Output should have 'age' field"
+        assert isinstance(json_obj["name"], str), "'name' should be a string"
+        assert isinstance(json_obj["age"], int), "'age' should be an integer"


very nice 🌶️

joerunde · 2026-04-22T17:14:47Z

@@ -0,0 +1,355 @@
+"""End-to-end tests for structured output decoding.


The tests here in general are pretty great!

There's one big problem here though- we're not using cached models which will really slow down the testing. On spyre if we don't use the model cache then it takes ~30s to reload a new model (for g3.3-micro that we test with). The choices here are either:

Consolidate all the cases into a single test so we load the model once with each backend and send a few batches of prompts to it

Add the structured_output_backend as a model cache key and use the cached llm with that backend

I'd prefer (2), it should be pretty quick to get that hooked up in the model cache. In our CI runs we also have the benefit of running all the tests with a cached LLM in a single process instead of forking the tests which is much faster as well

@joerunde i have added code for option 2, can you review it and let me know if its the correct approach, thanks

@R3hankhan123 there's a sort key class that probably needs to be updated as well, it needs to sort these test cases so that cases with identical LLMs are grouped together

Add support for structered decoding/structered output for sendnn-inference Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

joerunde · 2026-04-22T22:32:14Z

It also looks like there's a test failure with the mixed batch:

AssertionError: Can only schedule one chunked prefill at a time

so something is wrong and we've violated the constraint that we can only prefill a single request at a time

R3hankhan123 requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 as code owners April 7, 2026 11:13

R3hankhan123 requested a review from joerunde April 7, 2026 11:13

rafvasq reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/v1/core/test_scheduler_structured_outputs.py Outdated

Comment thread vllm_spyre/v1/worker/spyre_model_runner.py

Comment thread vllm_spyre/v1/core/scheduler.py Outdated

Comment thread vllm_spyre/v1/core/scheduler.py Outdated

R3hankhan123 force-pushed the vllm-structured-output branch 4 times, most recently from 6f52711 to e8ba677 Compare April 9, 2026 09:20

R3hankhan123 requested review from rafvasq and tjohnson31415 April 9, 2026 09:21

rafvasq reviewed Apr 9, 2026

View reviewed changes

Comment thread tests/v1/core/test_scheduler_structured_outputs.py

R3hankhan123 force-pushed the vllm-structured-output branch from e8ba677 to c75c86d Compare April 9, 2026 14:10

joerunde reviewed Apr 9, 2026

View reviewed changes

R3hankhan123 force-pushed the vllm-structured-output branch from c75c86d to 8fb0923 Compare April 9, 2026 14:39

joerunde requested changes Apr 14, 2026

View reviewed changes

R3hankhan123 force-pushed the vllm-structured-output branch from 8fb0923 to ed4f2b2 Compare April 21, 2026 04:57

R3hankhan123 requested a review from joerunde April 21, 2026 04:58

R3hankhan123 force-pushed the vllm-structured-output branch 2 times, most recently from 4b0cf02 to 31a7742 Compare April 21, 2026 05:05

joerunde reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm_spyre/v1/worker/spyre_model_runner.py Outdated

joerunde reviewed Apr 21, 2026

View reviewed changes

Comment thread vllm_spyre/v1/core/scheduler.py

joerunde reviewed Apr 21, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

R3hankhan123 force-pushed the vllm-structured-output branch 7 times, most recently from 36f5ff2 to 7b781d5 Compare April 22, 2026 08:56

R3hankhan123 requested a review from joerunde April 22, 2026 08:59